Implement abortIsolate() in cloudflare:workers module#6237
Implement abortIsolate() in cloudflare:workers module#6237
Conversation
|
ResolveMessage: Cannot find module '@opencode-ai/plugin' from '/home/runner/work/workerd/workerd/.opencode/tools/bazel-deps.ts' |
|
@hoodmane Bonk workflow failed. Check the logs for details. View workflow run · To retry, trigger Bonk again. |
ba459c8 to
1f9a0ea
Compare
|
ResolveMessage: Cannot find module '@opencode-ai/plugin' from '/home/runner/work/workerd/workerd/.opencode/tools/bazel-deps.ts' |
f11cf40 to
2adf211
Compare
Factor out the Isolate -> Script -> Worker creation pipeline from makeWorkerImpl() into a new Server::createWorker() method. This separates the worker creation logic (inspector policy, module registry, fallback service, artifact bundler, script compilation, worker construction) from the validation checks and post-creation wiring that are specific to makeWorkerImpl(). The experimental feature flag checks (NMR, module fallback, Python/NMR incompatibility) remain in makeWorkerImpl() since they only need to run once at initial config validation time. No behavioral changes.
8769c83 to
abcc698
Compare
|
We need to be a bit careful with this as overuse could lead to excessive isolate creation, especially if requests are long-running (like WebSockets). For your use case would it be OK if we actually terminated the worker, erroring in-flight requests? I would be more comfortable with that as it can't cause a build-up of condemned isolates so easily. If that works for you, I would call it |
My thought was that this should behave in exactly the same way as when the worker allocates 2x the memory limit and gets condemned. That way, it doesn't offer any additional surface area to worry about.
Is that what allocating too much memory does? The current situation is that every subsequent request fails until the worker is retired, so whatever we do will be an improvement over that -- better for just the in-flight requests to error than for however many future requests to fail. |
|
The way the memory limit works is that if the worker hits 1x the memory limit it is condemned, which allows it to complete the current work but will tear down the isolate once that work is complete. New requests will go to a fresh non-condemned isolate. If, while cleaning up, the condemned work goes on to hit 2x the memory limit, it gets immediately destroyed. @kentonv's concern is that if we just have this condemn the isolate, we might have a case where we have a ton on condemned isolates all just finishing processing of a single request in the worst case and are constantly spinning up new ones to replace it which causes a large amount of churn. Instead, this should work exactly like hitting the 2x limit. All activity being processed by the current isolate is interrupted immediately and the isolate is destroyed. |
|
In the Python use case the isolate won't be useful for anything anyways so it's definitely better to immediately destroy it and return errors. |
Add abortIsolate(reason) API that terminates the current JS isolate and creates a fresh one from scratch, resetting all module-level state. When called, it immediately terminates all in-flight requests on the isolate. The next request creates a new Worker. The abort-all mechanism uses a shared ForkedPromise that each IoContext subscribes to via onLimitsExceeded(). When abortIsolate() is called, the promise is rejected, causing every IoContext on the isolate to abort. This mirrors how the production 2x memory limit kill works. The reason string is included in all error messages across all aborted requests.
abcc698 to
3f64ddb
Compare
|
Okay updated the PR to kill all in-flight requests. |
|
This seems to have a huge amount of churn in Was this written by you or by AI?
Either way you need a separate change in the internal codebase for this to be supported in production. (It should actually be a lot easier there, though, since runtime loading and reloading of Workers is the norm there.) |
As to the latter part of this comment - I asked for local dev support here. |
| export class ServiceStub {} | ||
|
|
||
| export function waitUntil(promise: Promise<unknown>): void; | ||
| export function abortIsolate(reason?: string): never; |
There was a problem hiding this comment.
I believe the return value is void?
| export function abortIsolate(reason?: string): never; | |
| export function abortIsolate(reason?: string): void; |
There was a problem hiding this comment.
This function also needs to be added into the types folder (and later we need just generate-types)
There was a problem hiding this comment.
never is correct since the function does not return.
| containerEgressInterceptorImage(kj::mv(containerEgressInterceptorImageParam)), | ||
| isDynamic(isDynamic) {} | ||
| isDynamic(isDynamic) { | ||
| resetAbortAllPromise(); |
There was a problem hiding this comment.
Would you mind adding a comment above this line?
| if (workerFactory == kj::none) { | ||
| JSG_FAIL_REQUIRE(Error, "abortIsolate() is not supported for this worker configuration."); | ||
| } |
There was a problem hiding this comment.
Nit:
| if (workerFactory == kj::none) { | |
| JSG_FAIL_REQUIRE(Error, "abortIsolate() is not supported for this worker configuration."); | |
| } | |
| JSG_REQUIRE(workerFactory != kj::none, Error, "abortIsolate() is not supported for this worker configuration."); |
|
OK, if we need local dev... I kind of wonder whether local dev could work by just aborting the process? (Unless it's a dynamic isolate, which should be easy to throw away so that it gets recreated next time.) |
|
|
||
| // Immediately terminate V8 execution so no further JS runs in this request. | ||
| // This raises an uncatchable exception in V8, causing the request to fail immediately. | ||
| jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow(); |
There was a problem hiding this comment.
Don't use jsg::Lock::from(). Have the first argument to the method be jsg::Lock& js -- it'll be filled in by the JSG layer.
|
Can we convince Bonk to review this? |
| kj::Maybe<kj::String> moduleFallbackForFactory; | ||
| if (!isDynamicForFactory && isModulesSourceForFactory) { | ||
| auto msg = kj::heap<capnp::MallocMessageBuilder>(); | ||
| msg->setRoot(def.featureFlags); |
There was a problem hiding this comment.
Copy seems unnecessary: Since it's not a dynamic worker, the def should be long-lived.
| auto msg = kj::heap<capnp::MallocMessageBuilder>(); | ||
| msg->setRoot(def.featureFlags); | ||
| featureFlagsMsgForFactory = kj::mv(msg); | ||
| ownedSourceForFactory = kj::heap<OwnedWorkerSource>(def.source); |
There was a problem hiding this comment.
This is copying the whole source code of the worker! Just to support this feature. It's probably unnecessary (see previous comment).
|
The changes to server.c++ are too deep to be trusted to AI I think. Just spot checking it's pretty "sloppy" (see above comments). This need more than just a review loop where we all post comments, I think it needs a human engineer to go in and work on the code directly. |
|
/bigbonk roast this pr |
There was a problem hiding this comment.
Review findings
This is an AI-generated review and may contain inaccuracies. The findings below focus on correctness and safety issues in the implementation.
Summary
The overall design (abort-all via ForkedPromise, lazy recreation in startRequest()) is sound and mirrors the existing 2x-memory-limit kill path well. However, there are several correctness issues in the OwnedWorkerSource deep-copy, unresolved prior review comments, stale comments, and missing steps in the worker factory that need attention. See inline comments.
kentonv's concern about the amount of churn in server.c++ is well-founded -- the OwnedWorkerSource class and the deep-copy machinery add significant complexity for what is (by the PR's own admission) a workerd-only code path. The suggestion to explore simpler approaches (e.g., process restart for non-dynamic workers) deserves serious consideration before investing in fixing all the issues below.
|
|
||
| // Immediately terminate V8 execution so no further JS runs in this request. | ||
| // This raises an uncatchable exception in V8, causing the request to fail immediately. | ||
| jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow(); |
There was a problem hiding this comment.
[HIGH] Previously flagged issue not addressed: kentonv commented that jsg::Lock::from() should not be used — instead, the method should take jsg::Lock& js as its first parameter, which JSG fills in automatically.
The current code uses jsg::Lock::from(v8::Isolate::GetCurrent()) which is the discouraged pattern. The fix is straightforward:
| jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow(); | |
| jsg::Lock::from(v8::Isolate::GetCurrent()).terminateExecutionNow(); |
should become:
js.terminateExecutionNow();and the method signature (in the .h) should change to:
void abortIsolate(jsg::Lock& js, jsg::Optional<kj::String> reason);JSG will automatically provide the jsg::Lock& argument — it won't be visible to JavaScript callers.
|
|
||
| // Throws away the current JS isolate and recreates the worker from scratch. The current | ||
| // request continues running on the old isolate; subsequent requests will use a fresh | ||
| // isolate with re-executed top-level module code and fresh global state. |
There was a problem hiding this comment.
[MEDIUM] Stale comment. This says "The current request continues running on the old isolate" but the actual implementation (per jasnell/kentonv's feedback) aborts ALL in-flight requests immediately. The comment should read something like:
Throws away the current JS isolate and recreates the worker from scratch. All in-flight requests on the isolate are immediately terminated. Subsequent requests will use a fresh isolate with re-executed top-level module code and fresh global state.
The same stale wording appears in io-channels.h:279-281.
| explicit OwnedWorkerSource(const WorkerSource::ModulesSource& ms) | ||
| : ownedMainModule(kj::str(ms.mainModule)), | ||
| isPython(ms.isPython), | ||
| ownedModuleData(cloneAll(ms.modules)), | ||
| source(WorkerSource::ModulesSource{ | ||
| .mainModule = ownedMainModule, | ||
| .modules = buildAll(ownedModuleData), | ||
| .isPython = isPython, |
There was a problem hiding this comment.
[HIGH] OwnedWorkerSource drops several fields from ModulesSource:
-
capnpSchemas(capnp::List<capnp::schema::Node>::Reader) — not copied. This is a non-owning capnp reader that points into the original config message. While Cap'n Proto modules are marked "TODO(someday)" for workerd, if someone does use a CapnpModule, this will be a dangling reader. At minimum, the field should be initialized to an empty list or an orphan copy. -
pythonMemorySnapshot(kj::Maybe<capnp::AnyStruct::Reader>) — not copied. If a Python worker usesabortIsolate(), the recreated worker will be missing its memory snapshot. -
CommonJsModule::namedExports(kj::Maybe<kj::Array<kj::StringPtr>>) — the CJS clone at line 4541 only copies.bodybut dropsnamedExports. These are non-owningStringPtrs into the body, so they'd need to be reconstructed after cloning the body. -
EsModule::ownBody(kj::Maybe<::rust::String>) — line 4535 copies fromes.body(theArrayPtr<const char>) which works ifownBodyiskj::none(body points into capnp message). But ifownBodyis set (meaning body was transpiled from Rust and points intoownBody), the copy viakj::str(es.body)will correctly capture the content. However, the originalownBodyis silently dropped, which is fine since the clone owns the string. Just noting for completeness — this case is actually handled correctly.
Items 1-3 are bugs.
| if (!isDynamicForFactory && isModulesSourceForFactory) { | ||
| auto msg = kj::heap<capnp::MallocMessageBuilder>(); | ||
| msg->setRoot(def.featureFlags); | ||
| featureFlagsMsgForFactory = kj::mv(msg); | ||
| ownedSourceForFactory = kj::heap<OwnedWorkerSource>(def.source); |
There was a problem hiding this comment.
[HIGH] Previously flagged issue not addressed: kentonv commented that the deep copy of the worker source code here is unnecessary. For a static (non-dynamic) worker, the capnp config message that contains the source lives for the entire lifetime of the Server. The `WorkerSource` borrows from it, and that borrow remains valid.
Rather than the entire `OwnedWorkerSource` machinery (~100 lines), you could just store a `const WorkerSource&` reference to `def.source` (or a copy of the lightweight `WorkerSource` struct itself, which only contains non-owning pointers). The source data it points to lives in the config message, which outlives the `WorkerService`.
If the concern is that `def` gets moved into `linkCallback`, then extract the `WorkerSource` reference before that move (as is already done for other fields). The underlying data remains valid regardless of where the `WorkerSource` struct lives.
This would eliminate the `OwnedWorkerSource` class entirely and address kentonv's concern about unnecessary copies of the entire source code.
| auto newWorker = createWorker(workerName, source, featureFlags.asReader(), extensions, | ||
| moduleFallbackPtr, kj::mv(compileBindings)); | ||
|
|
||
| // Compile ctx.exports bindings. | ||
| newWorker->runInLockScope(Worker::Lock::TakeSynchronously(kj::none), [&](Worker::Lock& lock) { | ||
| JSG_WITHIN_CONTEXT_SCOPE(lock, lock.getContext(), [&](jsg::Lock& js) { | ||
| WorkerdApi::from(newWorker->getIsolate().getApi()) | ||
| .compileGlobals(lock, ctxExportsGlobals, ctxExportsHandle.getHandle(js), 1); | ||
| }); | ||
| { auto drop = kj::mv(ctxExportsHandle); } | ||
| }); | ||
|
|
||
| return newWorker; |
There was a problem hiding this comment.
[MEDIUM] The worker factory skips several steps that makeWorkerImpl() performs on the original worker:
-
preloadPython()(line 4735) — not called in the factory. If this is a Python worker, the recreated worker won't have preloaded Python artifacts. -
validateHandlers()(line 4774 in the original path) — not called in the factory. Handler validation errors will be silently skipped on recreation. -
Actor namespace initialization — the factory calls
createWorker()but doesn't callinitActorNamespaces()or set up actor channels. If the worker has Durable Objects, the recreated worker may not have them properly configured. (Though the existing WorkerService already has the actor namespaces, so this might be OK if they don't need re-initialization — worth verifying.) -
Inspector registration —
createWorker()does register with the inspector (good), but if the old isolate's inspector session is still active, there could be confusion. Worth considering.
Items 1-2 are potential bugs for Python workers and workers with unusual handler configurations.
|
|
||
| // Cloned env binding globals for use by abortIsolate() to recreate the worker with the same | ||
| // bindings. Only populated for static (non-dynamic) workers. | ||
| kj::Array<WorkerdApi::Global> envGlobalsForFactory; |
There was a problem hiding this comment.
[MEDIUM] This clones the env globals eagerly for every worker, even though it's only used when abortIsolate() is called (which most workers never do). This is the `envGlobalsForFactory` field.
Given that for static workers the `def` is moved into `linkCallback` anyway, and the underlying config data is long-lived, consider whether the globals could be cloned lazily (only on first `abortIsolate()` call) or borrowed from the config message. The `Global::clone()` does deep copies of strings and byte arrays for each binding, which is unnecessary work for the common case.
| // If abortIsolate() was called, lazily recreate the worker now (outside any isolate lock). | ||
| if (isolateAbortRequested) { | ||
| worker = KJ_ASSERT_NONNULL(workerFactory)(); |
There was a problem hiding this comment.
[MEDIUM] Race window between abortIsolate() and startRequest(). Consider: request A calls abortIsolate(), which sets isolateAbortRequested = true and rejects the abort-all promise. Request B arrives and calls startRequest(), which sees the flag, calls workerFactory() to create a new worker, and resets the flag. Meanwhile, the abort-all rejection is still propagating to other in-flight requests on the old isolate.
The factory call (workerFactory()) is synchronous and happens on the event loop thread, so there's no actual concurrency issue here. But it's worth noting that the factory creates a new V8 isolate synchronously in startRequest(), which blocks the event loop. For workers with large codebases, this could cause noticeable latency on the first request after an abort.
Also: if workerFactory() throws (e.g., compilation error in the worker script), isolateAbortRequested remains true (since the assignment at line 2105 is after the factory call). This means subsequent requests will keep trying to recreate and failing. Consider catching the exception and logging it, or at least resetting the flag regardless.
| export class ServiceStub {} | ||
|
|
||
| export function waitUntil(promise: Promise<unknown>): void; | ||
| export function abortIsolate(reason?: string): never; |
There was a problem hiding this comment.
[LOW] Previously flagged by anonrig: this function also needs to be added to the `types/` folder (with `just generate-types`). kentonv confirmed the `never` return type is correct (since the function does not return).
| abortAllFulfiller->reject(JSG_KJ_EXCEPTION(FAILED, Error, kj::mv(message))); | ||
|
|
||
| // Create a fresh promise/fulfiller for the next generation of requests (after recreation). | ||
| resetAbortAllPromise(); |
There was a problem hiding this comment.
[LOW] The abort-all mechanism looks correct: rejecting the fulfiller causes all `onLimitsExceeded()` branches to fire, which triggers `abortWhen()` -> `abort()` on each IoContext. The subsequent `resetAbortAllPromise()` creates a fresh pair for the next generation.
One subtlety: between rejecting the old fulfiller and calling `resetAbortAllPromise()`, if any code path called `onLimitsExceeded()` (e.g., a new IoContext being constructed), it would get a branch of the already-rejected promise and immediately abort. This is actually the desired behavior (we want requests on the old isolate to fail), but worth noting in a comment.
|
Review posted on PR #6237. Here's a summary of the findings: HIGH severity
MEDIUM severity
LOW severity
|

Add
abortIsolate(reason)API that terminates the current JS isolate andcreates a fresh one from scratch, resetting all module-level state. When
called, it immediately terminates all in-flight requests on the isolate.
The next request creates a new Worker.
The abort-all mechanism uses a shared
ForkedPromisethat eachIoContextsubscribes to via
onLimitsExceeded(). WhenabortIsolate()is called, thepromise is rejected, causing every
IoContexton the isolate to abort. Thismirrors how the production 2x memory limit kill works. The reason string is
included in all error messages across all aborted requests.